In this document we walk through how to access, analyze, and work with data exported from the AtoMx™ Spatial Informatics Platform (SIP). Specifically, data are stored as both a standalone Seurat object and an equivalent TileDB array. The exported data include: transcript counts/location, field of view images, annotation metadata, and user-initiated data transformations performed in AtoMx SIP prior to export. Applying foundational single-cell spatial analysis techniques, we illustrate the benefits enabled by leveraging the computational efficiency of TileDB and the flexibility of the Seurat objects created for CosMx SMI data.
Given the rapid evolution of CosMx SMI data analysis tools, the guidance in this vignette may differ from other published user documents accompanying the platform. Please speak to your NanoString Applications Scientist to discuss your particular project needs.
The export module can be obtained from a NanoString Applications Scientist or from the website. It is loaded into the CosMx SMI Data Analysis Suite in AtoMx SIP. For user documentation related to these platforms, please refer to the Data Export Guide or the User Manual.
The export module can be used at any point in a pipeline on AtoMx. All results up to the point of export will be in the Seurat object and TileDB array. These are equivalent outputs in different formats. We show below how to interact with both objects: TileDB, Seurat.
This format is the same for RNA and Protein studies but the format of the raw files folder will be different depending on the analyte.
It is recommended to only export Raw Files once per study. If multiple pipelines from the same study export Raw File there could be TBs of duplicated files. For this study, the zipped raw files are over 800 GB. These files are not zipped automatically in the export module. An explanation of all important raw files can be found here.
Due to the file size of these data, the export module gives the choice of what to export: TileDB array, Seurat object, and/or Raw Files.
After exporting, depending on parameters selected for export, the data folder can contain:
FullSeuratObject == TRUE)SWTA20230803015150 and 14AUG23_Val12_PTL6316_Exp8_1kplex37CPA_SI_S2We will continue this vignette with data from the Liver Public Data Release.
Two tissue sections from 5µm Formalin-fixed Paraffin Embedded (FFPE) sections:
Scan areas:
Fields of View (FOV)
Panel Used: CosMx Human Universal Cell Characterization Panel (1000-plex)
Segmentation markers:
library(ggplot2)
library(ggpubr)
library(ggrepel)
library(gridExtra)
library(matrixStats)
library(patchwork)
library(pheatmap)
library(Seurat)
library(RColorBrewer)
library(reshape2)
“TileDB is a powerful engine for storing and accessing dense and sparse multi-dimensional arrays, which can help you model any complex data efficiently.” TileDB GitHub
In addition to formats which store data in memory, the AtoMx SIP exports support TileDB formats which access and load data as memory is needed. This format, while less commonly used for single-cell projects, allows for scalability to very high-density analysis. As many CosMx studies will be well in excess of 1 million cells, this new format will enable robust and scalable computations across very large studies without requiring all data to be loaded into memory. CosMx SMI datasets can be large enough to quickly run out of RAM when doing analyses if the entire dataset is read into memory. By saving data in TileDB arrays we don’t need to have all of the data in memory, only the specific parts we are actively working on.
TileDBsc is an R implementation of the Stack of Matrices, Annotated (SOMA) API based on TileDB. Using the tiledbsc R package, one can access the data stored as TileDB arrays in the SOMA data model from our analysis pipeline. The TileDB organization, in association with the Chan Zuckerberg Initiative, is currently creating a supplemental library that will enable users to convert the two most popular single cell format, Seurat and AnnData, to and from SOMA to make data analysis in multiple languages like R and python simple. While all of the data are stored in TileDB arrays, it can be saved into a Seurat object to work in a more familiar format and to be able to use existing Seurat functions, see the Seurat Object section below for more information.
For this section of the tutorial, we will be focusing on two packages: TileDB-R and tiledbsc. TileDB-R allows us to use the low-level functions from TileDB like reading and writing matrices and tiledbsc is built on top of this framework specifically for single cell data that allows easy conversion to a Seurat object. f
if (!require("tiledb", quietly = TRUE))
remotes::install_github("TileDB-Inc/TileDB-R", force = TRUE,
ref = "0.17.0")
if (!require("tiledbsc", quietly = TRUE))
remotes::install_github("tiledb-inc/tiledbsc", force = TRUE,
ref = "8157b7d54398b1f957832f37fff0b173d355530e")
library(tiledb)
library(tiledbsc)
These packages are still under development but these are the stable versions used in the AtoMx pipeline.
tiledb: 0.17.0
tiledbsc: 0.1.4.9001
Set the tiledbURI variable below to point to the file path of the TileDB folder. For the example below, the TileDB folder is located in LiverDataRelease_TileDBArray folder. You can also set this to an Amazon Web Serivices (AWS) S3 bucket if your TileDB object is stored in an S3 bucket.
# File path to TileDB array on local machine
tiledbURI <- "LiverDataRelease_TileDBArray/"
If your data are in an Amazon S3 bucket (which is currently required for export), we need to set up the environment for TileDB to be able to interact with AWS. Depending on your AWS setup you might need to add your access key, secret key, and session token here as well. Specific instructions on setting up the AWS account can be found here.
# AWS region of the URI
s3Region <- "us-west-2"
aws_creds <- rjson::fromJSON( paste(system2("aws",
args = c("configure",
"export-credentials"),
stdout = TRUE), collapse = ""))
# configure tiledb S3
cfg <- tiledb_config()
cfg["vfs.s3.region"] <- s3Region
cfg["vfs.s3.aws_access_key_id"] <- aws_creds$AccessKeyId
cfg["vfs.s3.aws_secret_access_key"] <- aws_creds$SecretAccessKey
cfg["vfs.s3.aws_session_token"] <- aws_creds$SessionToken
ctx <- tiledb_ctx(cfg)
The main object in TileDB is a Stack of Matrices, Annotated (SOMA). SOMA is an open data model for representing annotated matrices, like those commonly used for single-cell data analysis. A SOMACollection contains multiple SOMAs. In CosMx SMI data, there is a SOMA for targets RNA species, negative control probes, and FalseCodes.
For protein datasets, count data are currently stored in RNA SOMA. We are currently investigating creating a separate SOMA specifically for protein and may come in a latter release.
An example of the basic TileDB data structure can be seen below:
directory
|-- __group
| |-- __data_specific_alphanumeric_name_here
| |-- __data_specific_alphanumeric_name_here
|-- __meta
| |-- __data_specific_alphanumeric_name_here
|-- __tiledb_group.tdb
|-- soma_RNA
| |-- X
| |-- __group
| |-- __meta
| |-- __tiledb_group.tdb
| |-- obs
| |-- obsm
| |-- obsp
| |-- uns
| |-- var
| |-- varm
| |-- varp
|-- uns
|-- __group
|-- __meta
|-- __tiledb_group.tdb
|-- commands
These files are not touched except through TileDB accessor functions, shown below.
You can read the arrays stored in the TileDB object using the tiledbsc R package which offers accesor functions to easily read the data into memory or access only.
# read in SOMACollection
tiledb_scdataset <- tiledbsc::SOMACollection$new(uri = tiledbURI,
verbose = FALSE)
The TileDB object is a collection of pointers to the RNA counts, normalized RNA counts, negative control probes, and falsecode SOMAs. Each SOMA follows the AnnData shape.
For protein datasets, count data are currently stored in RNA SOMA. We are currently investigating creating a separate SOMA specifically for protein and may come in a later release.
AnnData Shape
names(tiledb_scdataset$somas)
## [1] "falsecode" "negprobes" "RNA" "RNA_normalized"
names(tiledb_scdataset$somas$RNA$members)
## [1] "X" "varp" "obs" "varm" "obsp" "uns" "obsm" "var"
Having an object of pointers allows for small memory usage. This TileDB array with an entire study only used 3.23 MB of memory even though it contains nearly 800,000 cells!
The main parts of CosMx SMI data are:
X slotobs slotobsm slotThese parts of the data can be loaded into memory separately or all together.
Individual TileDB objects might have different orders of cells. Please pay attention to these orders.
All matrices in TileDB are stored as sparse matrices. This structure is used for querying the data.
Count matrices currently have targets on rows and cells on columns like a standard Seurat object. In future TileDB releases, these will be transposed to look more like AnnData. This change will decrease load times in TileDB but any coercion to Seurat, shown below, will not be affected.
The raw data cell-by-target expression data are stored in the X slot and can be retrieved and stored in memory using the below code.
# batch_mode parallelizes the readin, decreasing computation time
counts <- tiledb_scdataset$somas$RNA$X$members$counts$to_matrix(batch_mode = TRUE)
dim(counts)
## [1] 1000 793318
counts[1:4,1:4]
## 4 x 4 sparse Matrix of class "dgTMatrix"
## c_1_100_1000 c_1_100_1017 c_1_100_1028 c_1_100_1043
## AATK 1 1 2 1
## ABL1 . . . .
## ABL2 . . . .
## ACACB . . . .
Memory used: 2.4 GB
Use the code below to retrieve and save the normalized counts in memory.
These data were normalized using Pearson Residuals to account for library size factors to ensure that cell specific total transcript abundance and distribution of counts, which may vary some between FOVs and samples. See Normalization Section for more details.
In future releases normalized data will be located in tiledb_scdataset$somas$RNA$X$members$data and not in a separate SOMA.
norm <- tiledb_scdataset$somas$RNA_normalized$X$members$data$to_matrix(batch_mode = TRUE)
dim(norm)
## [1] 1000 793318
norm[1:4,1:4]
## 4 x 4 sparse Matrix of class "dgTMatrix"
## c_1_100_1 c_1_100_10 c_1_100_100 c_1_100_1000
## AATK -0.1429034 -0.06889529 -0.2645790 7.5780320
## ABL1 -0.1972264 -0.09509172 -0.3650734 -0.1790312
## ABL2 -0.1667603 -0.08039924 -0.3087212 -0.1513743
## ACACB -0.2951553 -0.14233400 -0.5460241 -0.2679370
Memory used: 12.75 GB
Cell-level metadata are stored in obs slot. It consists of cell information read from the output of CosMx SMI. Some worth noting are the centroid x-y locations, flow cell, fov and other cell stats from the Cell_Stats csv files from raw data inputs. The obs slot will also store results from modules run from the pipeline. Some examples are the QC module and clustering modules. The code below can be used to retrieve and to store the cell metadata from obs slot into memory.
metadata <- tiledb_scdataset$somas$RNA$obs$to_dataframe()
dim(metadata)
## [1] 793318 67
metadata[1:4,1:10]
## orig.ident nCount_RNA nFeature_RNA nCount_negprobes
## c_1_100_1 c 142 77 0
## c_1_100_10 c 33 26 0
## c_1_100_100 c 487 143 0
## c_1_100_1000 c 117 69 0
## nFeature_negprobes RNA_pca_cluster_default
## c_1_100_1 0 16
## c_1_100_10 0 13
## c_1_100_100 0 3
## c_1_100_1000 0 13
## RNA_pca_cluster_default.1 nCount_falsecode nFeature_falsecode fov
## c_1_100_1 20 0 0 100
## c_1_100_10 15 5 5 100
## c_1_100_100 7 1 1 100
## c_1_100_1000 16 2 2 100
Memory used: 447.44 MB
If you do not want all the cell metadata to be read into memory, you can read specific columns by specifying the attrs parameter. We can use this metadata to visualize the tissue structure by specifically plotting where each cell is in physical coordinate space.
head(tiledb_scdataset$somas$RNA$obs$attrnames())
## [1] "orig.ident" "nCount_RNA"
## [3] "nFeature_RNA" "nCount_negprobes"
## [5] "nFeature_negprobes" "RNA_pca_cluster_default"
cellCoords <- tiledb_scdataset$somas$RNA$obs$to_dataframe(
attrs = c("x_FOV_px", "y_FOV_px", "x_slide_mm", "y_slide_mm",
"slide_ID_numeric", "Run_Tissue_name", "fov"))
head(cellCoords)
## x_FOV_px y_FOV_px x_slide_mm y_slide_mm slide_ID_numeric
## c_1_100_1 42 36 8.70804 9.73368 1
## c_1_100_10 2737 25 9.03144 9.73500 1
## c_1_100_100 2888 409 9.04956 9.68892 1
## c_1_100_1000 3457 3742 9.11784 9.28896 1
## c_1_100_1001 1298 3763 8.85876 9.28644 1
## c_1_100_1002 1001 3738 8.82312 9.28944 1
## Run_Tissue_name fov
## c_1_100_1 NormalLiver 100
## c_1_100_10 NormalLiver 100
## c_1_100_100 NormalLiver 100
## c_1_100_1000 NormalLiver 100
## c_1_100_1001 NormalLiver 100
## c_1_100_1002 NormalLiver 100
ggplot(cellCoords, aes(x=x_slide_mm, y=y_slide_mm))+
geom_point(alpha = 0.05, size = 0.01)+
facet_wrap(~Run_Tissue_name)+
coord_equal()+
labs(title = "Cell coordinates in XY space")
Transcript coordinates are x-y locations of individual transcript. Note that one cell contains many transcripts. These are currently stored in obsm slot. In future releases transcript coordinates will be stored in the uns slot: tiledb_scdataset$somas$RNA$uns$members$transcriptCoords
To read the entire transcript coordinate data frame into memory:
transcriptCoords <- tiledb::tiledb_array(
tiledb_scdataset$somas$RNA$obsm$members$transcriptCoords$uri,
return_as="data.frame")[]
head(transcriptCoords)
In many cases, having all of the transcript coordinates in memory is too large. TileDB allows for querying of the data to significantly decrease the amount of data read into memory. Here we only want the transcripts in FOV 15 to plot later.
fov <- 15
qc <- tiledb_query_condition_init(attr = "fov",
value = fov,
dtype = "INT32",
op = "EQ")
transcriptCoords <- tiledb_array(tiledb_scdataset$somas$RNA$obsm$members$transcriptCoords$uri,
query_condition=qc, return_as="data.frame")[]
head(transcriptCoords)
## slideID fov x_FOV_px y_FOV_px z_FOV_slice target CellId cell_id
## 1 1 15 1375 9 1 HMGN2 1271 c_1_15_1271
## 2 1 15 25 10 6 MALAT1 1 c_1_15_1
## 3 1 15 25 10 5 XBP1 1 c_1_15_1
## 4 1 15 36 10 3 SLCO2B1 1 c_1_15_1
## 5 1 15 167 10 6 RPL32 23 c_1_15_23
## 6 1 15 201 10 7 APOA1 23 c_1_15_23
## CellComp
## 1 Nuclear
## 2 Nuclear
## 3 Nuclear
## 4 Nuclear
## 5 Cytoplasm
## 6 Cytoplasm
One can visualize this by plotting transcripts from an individual sample (flow cell 1, normal liver) and a specific region within the tissue (FOV 15). In the graph below the centroid of each cell is shown as a black point and each transcript is shown as a colored point.
slide <- 1
fov <- 15
slideName <- unique(cellCoords$Run_Tissue_name[cellCoords$slide_ID_numeric ==
slide])
fovCoords <- cellCoords[cellCoords$slide_ID_numeric == slide &
cellCoords$fov == fov,]
fovTranscriptCoords <- transcriptCoords[transcriptCoords$slideID == slide &
transcriptCoords$fov == fov,]
targetCounts <- table(fovTranscriptCoords$target)
targets <- names(targetCounts[which(targetCounts >= 2500 &
targetCounts <= 3000)])
fovTranscriptCoords <- fovTranscriptCoords[fovTranscriptCoords$target %in%
targets,]
ggplot(fovCoords, aes(x=x_FOV_px, y=y_FOV_px))+
geom_point(size = 0.5, color = "black")+
geom_point(data = fovTranscriptCoords,
mapping = aes(x=x_FOV_px,
y=y_FOV_px,
color = target),
size = 0.6, alpha = 0.6)+
theme_bw()+
coord_equal()+
guides(colour = guide_legend(override.aes = list(size=1,
alpha=1)))+
labs(color = "RNA Target", title = paste0("RNA Transcripts in\n",
slideName, "\nFOV", fov))
In addition to storing data about count information, the AtoMx SIP enables downstream analysis as well including methods such as differential expression and ligand-receptor analysis. Metadata from these analyses can also be included in the exported results from AtoMx. See Analysis Results Section for full description of each analysis module that can be run in AtoMx SIP, manual found here.
The results from analysis modules can be found in slots dependent on their output shape. They follow the AnnData object shape shown above. Many of the results names from an AtoMx export will have a globally unique identifier (GUID) attached to their name. GUIDs are used to ensure running multiple modules in the same pipeline don’t overwrite the results. Each module has it’s own GUID meaning that results from the same module run can be grouped.
| slot | type of data | results |
|---|---|---|
| obs | cell metadata | |
| obsm | data on cells requiring more than 1 column | PathwayCellsAUC, dimreduction_pca, dimreduction_approximateumap, latest.fovs, RNA_nbclust_loglikes, RNA_normalized_cluster_normExpr, cellScores, transcriptCoords, RNA_normalized_markers_result, RNA_normalized_top_markers, dimreduction_approximateUMAP_bySlide |
| obsp | paired result with cells on rows and columns | graph_RNA_pca_snn, graph_RNA_pca_nn, spatial_distances, RNA_spatialDE_W |
| var | target metadata | |
| varm | data on targets requiring more than 1 column | dimreduction_pca |
| varp | paired result with targets on rows and columns | RNA_LeesLRes, RNA_nbclust_log_likelyhood_profiles |
| uns | unstructured data, doesn’t have to follow cells or targets | spatEnrichRes, cellTypeScores, cellTypePairs, linkScores, cellProximity |
Transcript Coordinates are currently in obsm but will be moved to uns in a later release
Here is a small example on how to read in two results arrays. All results follow a similar read-in pattern, the only thing that needs to change is the slot and result matrix names:
tiledb_scdataset$somas$RNA$<put slot name here>$members$<put result matrix name here>$to_matrix()
To get the result for PCA, note from the table above, PCA is saved in the obsm slot and the name of the result matrix is dimreduction_pca:
tiledb_scdataset$somas$RNA$obsm$members$dimreduction_pca$to_matrix()
pca <- tiledb_scdataset$somas$RNA$obsm$members$dimreduction_pca$to_matrix()
pca[1:4,1:4]
## PCA_1 PCA_2 PCA_3 PCA_4
## c_1_100_1 -1.6417182 -2.33161620 -2.99827387 -0.1049817
## c_1_100_10 -1.1434024 -0.08528228 -1.04553850 0.3209358
## c_1_100_100 -7.0135208 0.52200934 0.09413956 0.1917221
## c_1_100_1000 -0.2238483 -2.22115908 -0.78119275 1.5374492
neighbs <- tiledb_scdataset$somas$RNA$obsp$members$graph_RNA_pca_nn$to_seurat_graph()
neighbs[1:4,1:25]
## 4 x 25 sparse Matrix of class "dgCMatrix"
##
## c_1_100_1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 . . . . .
## c_1_100_10 . . . . . . . . . . . . . . . . . . . . 1 1 1 1 1
## c_1_100_100 . . . . . . . . . . . . . . . . . . . . . . . . .
## c_1_100_1000 . . . . . . . . . . . . . . . . . . . . . . . . .
The TileDB objects can also be read into Seurat objects. The TileDBsc native to_seurat function does not read in data from slots like uns so some analysis results will not be attached to the Seurat object and need to be read in individually from the TileDB array like shown above.
# Seurat object with all SOMAs
# seurat <- tiledb_scdataset$to_seurat(batch_mode = TRUE)
# seurat
RNA_seurat <- tiledb_scdataset$to_seurat(somas = c("RNA"), batch_mode = TRUE)
RNA_seurat
## An object of class Seurat
## 1000 features across 793318 samples within 1 assay
## Active assay: RNA (1000 features, 0 variable features)
## 3 dimensional reductions calculated: pca, approximateumap, approximateUMAP_bySlide
These Seurat objects can be used in Seurat functions for calculations or visualizations.
RNA_seurat@meta.data <- metadata
Idents(RNA_seurat) <- RNA_seurat$cellType
markers <- FindMarkers(RNA_seurat, ident.1 = unique(Idents(RNA_seurat))[1L],
logfc.threshold = 0.25, test.use = "roc",
only.pos = TRUE)
VlnPlot(RNA_seurat, features = head(rownames(markers), 2),
log = TRUE, pt.size = 0)
While a Seurat object might be easier to use, please be aware that reading an entire dataset into memory can be RAM exhaustive. This Seurat object requires 3.19 GB of RAM vs the TileDB array which only uses 3.27 MB at loading.
TileDB objects can also be read into python.
This section is in R and only used to set up python environment.
# install packages to use python in R
if (!require("rminiconda", quietly = TRUE))
remotes::install_github("hafen/rminiconda")
library(rminiconda)
library(reticulate)
# create python environment
if(!rminiconda::is_miniconda_installed("tiledb_python")){
rminiconda::install_miniconda(version = 3, name = "tiledb_python")
}
# use created python environment
py <- rminiconda::find_miniconda_python("tiledb_python")
reticulate::use_python(py, required = TRUE)
# install necessary python packages into environment
for(i in c("tiledb", "anndata", "tiledbsoma==0.1.19",
"scanpy", "gc-python-utils")){
rminiconda::rminiconda_pip_install(pkg_name = i, name = "tiledb_python")
}
The rest of this section is in python.
import tiledb
import tiledbsoma
from anndata import AnnData
import scanpy as sc
tiledbURI = "LiverDataRelease_TileDBArray/"
s3Region = "us-west-2"
# set up s3 environment
config = tiledb.Config()
config.update({"vfs.s3.region" : s3Region})
# config.update({"vfs.s3.aws_access_key_id" : ""})
# config.update({"vfs.s3.aws_secret_access_key" : ""})
# config.update({"vfs.s3.aws_session_token" : ""})
ctx = tiledb.Ctx(config)
# read in SOMACollection
pySoma = tiledbsoma.SOMACollection(tiledbURI, ctx=ctx)
pySoma.keys()
## ['RNA_normalized', 'RNA', 'negprobes', 'falsecode', 'uns']
All matrices in TileDB are stored as sparse matrices. This structure is used for querying the data.
Count matrices currently have targets on rows and cells on columns like a standard Seurat object. In future TileDB releases, these will be transposed to look more like AnnData. This change will decrease load times in TileDB but any coercion to Seurat will not be affected.
counts = pySoma['RNA'].X['counts'].csr()
counts
## <793318x1000 sparse matrix of type '<class 'numpy.float64'>'
## with 146503383 stored elements in Compressed Sparse Row format>
These data were normalized using Pearson Residuals to account for library size factors to ensure that cell specific total transcript abundance and distribution of counts, which may vary some between FOVs and samples. See Normalization Section for more details.
In future releases normalized data will be located in pySoma['RNA'].X['data'] and not in a separate SOMA.
norm = pySoma['RNA_normalized'].X['data'].csr()
norm
## <793318x1000 sparse matrix of type '<class 'numpy.float64'>'
## with 793317999 stored elements in Compressed Sparse Row format>
Cell-level metadata are stored in obs slot. It consists of cell information read from the output of CosMx SMI. Some worth noting are the x-y locations, flow cell, fov and other cell stats from the Cell_Stats csv files from raw data inputs. The obs slot will also store results from modules run from the pipeline. Some examples are the QC module and clustering modules. The code below can be used to retireve and to store the cell metadata from obs slot into memory.
obs = pySoma['RNA'].obs.df()
obs.head()
## orig.ident nCount_RNA ... i.median_RNA i.RNA_quantile_0.9
## obs_id ...
## c_1_100_1 c 142.0 ... 110.0 690.6
## c_1_100_10 c 33.0 ... 110.0 690.6
## c_1_100_100 c 487.0 ... 110.0 690.6
## c_1_100_1000 c 117.0 ... 110.0 690.6
## c_1_100_1001 c 320.0 ... 110.0 690.6
##
## [5 rows x 67 columns]
Transcript coordinates are x-y locations of individual transcript. Note that one cell contains multiple transcripts. These are currently stored in obsm slot. In future releases transcript coordinates will be located here: pySoma['RNA'].uns["transcriptCoords"].uri
transcriptCoords = tiledb.open_dataframe(pySoma['RNA'].obsm["transcriptCoords"].uri, ctx=ctx)
transcriptCoords
In many cases, having all of the transcript coordinates in memory is too large. TileDB allows for querying of the data to significantly decrease the amount of data read into memory. Here we only want the transcripts in FOV 15.
with tiledb.open(pySoma['RNA'].obsm["transcriptCoords"].uri, mode="r", ctx=ctx) as A:
q = A.query(cond="fov == 15")
print(q.df[:,:])
## slideID fov x_FOV_px ... CellId cell_id CellComp
## 0 1 15 1375 ... 1271 c_1_15_1271 Nuclear
## 1 1 15 25 ... 1 c_1_15_1 Nuclear
## 2 1 15 25 ... 1 c_1_15_1 Nuclear
## 3 1 15 36 ... 1 c_1_15_1 Nuclear
## 4 1 15 167 ... 23 c_1_15_23 Cytoplasm
## ... ... ... ... ... ... ... ...
## 1409703 2 15 2226 ... 590 c_2_15_590 Membrane
## 1409704 2 15 2237 ... 590 c_2_15_590 Cytoplasm
## 1409705 2 15 2477 ... 278 c_2_15_278 Membrane
## 1409706 2 15 2497 ... 285 c_2_15_285 Cytoplasm
## 1409707 2 15 3143 ... 586 c_2_15_586 Cytoplasm
##
## [1409708 rows x 9 columns]
See R read in section for full list of analysis results and Analysis Results Section for full description of module that can be run in AtoMx SIP.
pcad = pySoma['RNA'].obsm['dimreduction_pca'].df()
pcad.head()
## PCA_1 PCA_2 PCA_3 ... PCA_48 PCA_49 PCA_50
## obs_id ...
## c_1_100_1 -1.641718 -2.331616 -2.998274 ... 0.208275 -0.535591 -0.390931
## c_1_100_10 -1.143402 -0.085282 -1.045539 ... 0.334928 0.035539 0.606723
## c_1_100_100 -7.013521 0.522009 0.094140 ... 1.244416 0.536143 1.016428
## c_1_100_1000 -0.223848 -2.221159 -0.781193 ... 2.616751 -0.784868 -0.295499
## c_1_100_1001 -3.051702 -4.481670 0.119539 ... -0.903006 -0.003687 1.911351
##
## [5 rows x 50 columns]
These data can easily be coerced into AnnData format that can be plugged into squidpy or other python packages.
coordinates = obs[["x_slide_mm", "y_slide_mm"]]
adata = AnnData(norm, obs = obs, obsm={"spatial": coordinates}, dtype = "float32")
adata
## AnnData object with n_obs × n_vars = 793318 × 1000
## obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'nCount_negprobes', 'nFeature_negprobes', 'RNA_pca_cluster_default', 'RNA_pca_cluster_default.1', 'nCount_falsecode', 'nFeature_falsecode', 'fov', 'Area', 'AspectRatio', 'Width', 'Height', 'Mean.PanCK', 'Max.PanCK', 'Mean.CK8.18', 'Max.CK8.18', 'Mean.Membrane', 'Max.Membrane', 'Mean.CD45', 'Max.CD45', 'Mean.DAPI', 'Max.DAPI', 'cell_id', 'assay_type', 'Run_name', 'slide_ID_numeric', 'Run_Tissue_name', 'Panel', 'Mean.Yellow', 'Max.Yellow', 'Mean.CD298_B2M', 'Max.CD298_B2M', 'cell_ID', 'x_FOV_px', 'y_FOV_px', 'x_slide_mm', 'y_slide_mm', 'propNegative', 'complexity', 'errorCtEstimate', 'percOfDataFromError', 'qcFlagsRNACounts', 'qcFlagsCellCounts', 'qcFlagsCellPropNeg', 'qcFlagsCellComplex', 'qcFlagsCellArea', 'median_negprobes', 'negprobes_quantile_0.9', 'median_RNA', 'RNA_quantile_0.9', 'nCell', 'nCount', 'nCountPerCell', 'nFeaturePerCell', 'propNegativeCellAvg', 'complexityCellAvg', 'errorCtPerCellEstimate', 'percOfDataFromErrorPerCell', 'qcFlagsFOV', 'cellType', 'niche', 'i.median_negprobes', 'i.negprobes_quantile_0.9', 'i.median_RNA', 'i.RNA_quantile_0.9'
## obsm: 'spatial'
Other documentation for accessing TileDB objects can be found here:
Because the TileDBsc native to_seurat function does not attach data from slots like uns we also export data in a native Seurat format. The AtoMx SIP export function has a Seurat converter that attaches all results to Seurat object if FullSeuratObject == TRUE, and future releases will be focused on expanding compatibility of TileDB data with Seurat.
If FullSeuratObject == FALSE, the object will contain count data, metadata, and dimension reduction results. Here we will walk through the resulting full Seurat object.
Working off of S3 paths are a little trickier than using local files. Here is an example of how to read in an RDS file from S3. If you are using local data you can simply use readRDS().
fullSeuratPath <- "LiverDataReleaseSeurat.RDS"
bucket <- strsplit(fullSeuratPath,"/")[[1L]][3L]
file_key <- strsplit(fullSeuratPath, paste0(bucket,"/"))[[1L]][2L]
system2(command = "aws", arg = c("s3api", "get-object","--bucket",bucket,
"--key", file_key,paste0("/tmp/tmp.RDS")),
stdout = FALSE)
fullSeurat <- readRDS("/tmp/tmp.RDS")
rm("/tmp/tmp.RDS")
# fullSeurat <- readRDS(fullSeuratPath)
fullSeurat
## An object of class Seurat
## 1197 features across 793318 samples within 3 assays
## Active assay: RNA (1000 features, 0 variable features)
## 2 other assays present: falsecode, negprobes
## 3 dimensional reductions calculated: pca, approximateumap, approximateUMAP_bySlide
While a Seurat object can be easier to use, please be aware that reading an entire dataset into memory can be RAM exhaustive. This full Seurat object, containing all of the count data, metadata, and analysis results data are 50.03 GB in size.
fullSeurat[["RNA"]]@counts[1:4,1:4]
## 4 x 4 sparse Matrix of class "dgCMatrix"
## c_1_100_10 c_1_100_1078 c_1_100_1135 c_1_100_267
## AATK . 1 . .
## ABL1 . 2 . .
## ABL2 . . . .
## ACACB . 2 . 1
These data were normalized using Pearson Residuals to account for library size factors to ensure that cell specific total transcript abundance and distribution of counts, which may vary some between FOVs and samples. See Normalization Section for more details.
One major difference between our custom toSeurat function is that the normalized data are in the expected location for later releases. It is located in the RNA assay under the data slot instead of in its own assay.
fullSeurat[["RNA"]]@data[1:4,1:4]
## 4 x 4 sparse Matrix of class "dgCMatrix"
## c_1_100_10 c_1_100_1078 c_1_100_1135 c_1_100_267
## AATK -0.06889529 1.5266149 -0.1577291 -0.3114469
## ABL1 -0.09509172 2.2435234 -0.2176835 -0.4296907
## ABL2 -0.08039924 -0.5759492 -0.1840596 -0.3633908
## ACACB -0.14233400 0.9318722 -0.3257528 0.9076156
fullSeurat@meta.data[1:4,1:10]
## orig.ident nCount_RNA nFeature_RNA nCount_negprobes
## c_1_100_10 c 33 26 0
## c_1_100_1078 c 1699 295 0
## c_1_100_1135 c 173 131 1
## c_1_100_267 c 675 204 2
## nFeature_negprobes RNA_pca_cluster_default
## c_1_100_10 0 13
## c_1_100_1078 0 7
## c_1_100_1135 1 12
## c_1_100_267 2 15
## RNA_pca_cluster_default.1 nCount_falsecode nFeature_falsecode fov
## c_1_100_10 15 5 5 100
## c_1_100_1078 11 7 7 100
## c_1_100_1135 10 9 8 100
## c_1_100_267 17 3 3 100
head(fullSeurat@misc$transcriptCoords)
## slideID fov x_FOV_px y_FOV_px z_FOV_slice target CellId cell_id
## 1 2 77 212 9 2 FalseCode23 460 c_2_77_460
## 2 2 58 565 9 6 FalseCode197 503 c_2_58_503
## 3 2 301 1160 9 4 FalseCode197 438 c_2_301_438
## 4 1 8 2183 9 7 FalseCode94 11 c_1_8_11
## 5 2 34 2364 9 0 FalseCode184 7 c_2_34_7
## 6 2 26 2467 9 3 FalseCode199 8 c_2_26_8
## CellComp
## 1 Cytoplasm
## 2 Nuclear
## 3 Cytoplasm
## 4 Cytoplasm
## 5 Membrane
## 6 Cytoplasm
In SMIDA v1.3, transcript coordinates in the Seurat object will be added to the Seurat@images slot to allow for easy integration to Seurat vignettes.
See Analysis Results Section for full description of module that can be run in AtoMx SIP. Many of the results names from an AtoMx export will have a globally unique identifier (GUID) attached to their name.
| slot | type of data | results |
|---|---|---|
| meta.data | cell metadata | |
| graphs | nearest neighbor graphs | graph_RNA_pca_snn, graph_RNA_pca_nn, spatial_distances, RNA_spatialDE_W |
| reductions | DimReduc objects | pca, approximateumap, approximateUMAP_bySlide |
| misc | additional data on cells | transcriptCoords, latest.fovs, PathwayCellsAUC, RNA_nbclust_loglikes, RNA_normalized_cluster_normExpr, cellScores, RNA_normalized_markers_result, RNA_normalized_top_markers, spatEnrichRes, cellTypeScores, cellTypePairs, linkScores, cellProximity_cellproximity_heatmap, cellProximity_cellproximity_pcf, cellProximity_cellproximity_which |
| assay@misc | additional data on target | RNA_LeesLRes, RNA_nbclust_log_likelyhood_profiles |
| assay@meta.features | target metadata |
In this section we will demonstrate how to perform several common actions using TileDB in R and python, as well as with Seurat in R.
This section shows how to plot a UMAP colored by the cell type metadata column. We create a simple plotting function to be used in multiple sections below (plotting).
colorColumn = "cellType"
Read in UMAP from TileDB R
umap_TileDB <- as.data.frame(tiledb_scdataset$somas$RNA$obsm$members$dimreduction_approximateumap$to_matrix())
umap_TileDB$colorBy <- tiledb_scdataset$somas$RNA$obs$to_dataframe(attrs = colorColumn)[[1L]]
Read in UMAP from TileDB python
umap_py = pySoma['RNA'].obsm['dimreduction_approximateumap'].df()
metadata = pySoma['RNA'].obs.df()
umap_py["colorBy"] = metadata[r.colorColumn]
Read in UMAP from Seurat
umap_Seurat <- as.data.frame(fullSeurat@reductions$approximateumap@cell.embeddings)
umap_Seurat$colorBy <- fullSeurat[[colorColumn]][match(rownames(umap_Seurat),
rownames(fullSeurat@meta.data)), 1]
colrs <- brewer.pal.info[brewer.pal.info$colorblind == TRUE, ]
colorSchemes <- c("PuOr", "Dark2", "Set2", "BrBG")
colrs <- colrs[colorSchemes,]
col_vec <- unique(unlist(mapply(brewer.pal, colrs$maxcolors, colorSchemes)))
col_vec <- col_vec[-grep(col_vec, pattern = "#F|#E|#D")]
plotting <- function(data, title, Xcol, Ycol, Xname, Yname, color,
colorName, size = 0.02, alpha = 0.05){
gp <- ggplot(data, aes(x = .data[[Xcol]], y = .data[[Ycol]],
color = .data[[color]]))+
geom_point(size = size, alpha = alpha)+
coord_equal()+
labs(title = title, color = colorName,
x = Xname, y = Yname)+
scale_color_manual(values = col_vec)+
guides(colour = guide_legend(override.aes = list(size=1,
alpha=1),
ncol = 1))+
theme(legend.text=element_text(size=12))
return(gp)
}
umapPlot <- function(data, title, colorBy = "colorBy", colorColumn = "Cell Type", ...){
return(plotting(data, title, Xcol = "APPROXIMATEUMAP_1",
Ycol = "APPROXIMATEUMAP_2", Xname = "UMAP1",
Yname = "UMAP2", color = colorBy, colorName = colorColumn,
...))
}
pcaPlot <- function(data, title, colorBy = "colorBy", colorColumn = "Cell Type", ...){
return(plotting(data, title, Xcol = "PCA_1",
Ycol = "PCA_2", Xname = "PCA_1",
Yname = "PCA_2", color = colorBy, colorName = colorColumn,
...))
}
xySlidePlot <- function(data, title, colorBy = "colorBy", colorColumn = "Cell Type", ...){
return(plotting(data, title, Xcol = "x_slide_mm",
Ycol = "y_slide_mm", Xname = "x_slide_mm",
Yname = "y_slide_mm", color = colorBy, colorName = colorColumn,
...))
}
All of the UMAPs are the same data regardless of how you read in the data.
| Readin method | Time | Memory |
|---|---|---|
| TileDB R | 4.991 seconds | 22.31 MB |
| TileDB python | 5.623 seconds | 19.04 MB |
| Full Seurat | 14.199 minutes | 50.03 GB |
umapGP_TileDB <- umapPlot(umap_TileDB, title = "TileDB - R")
umapGP_python <- umapPlot(reticulate::py$umap_py, title = "TileDB - python")
umapGP_Seurat <- umapPlot(umap_Seurat, title = "Seurat")
ggpubr::ggarrange(umapGP_TileDB,umapGP_python, umapGP_Seurat,
common.legend = TRUE, legend = "right", ncol = 1)
These are examples of current and future analysis modules available in AtoMx SIP. They provide analytical methods for adding results to a TileDB study. Please note that each study may only use a subset of these modules, and that users can create their own analysis modules which may have different formats and data specifications. Each section below shows how to work with each of the resulting features in the TileDB and Seurat objects (only available in v1.1 of AtoMx export). Manual can be found here.
Any module that was run before export, will be available in the TileDB and Seurat, when FullSeuratObject == TRUE. If FullSeuratObject == FALSE, the object will contain count data, metadata, and dimension reduction results. All modules can be run with either RNA and protein unless otherwise stated.
The results names generated by CosMx SMI Data Analysis differ slightly from the names in the liver dataset described here. Most results exported will have a GUID as part of the result name.
tiledb_scdataset$somas$RNA$obsseurat@meta.datatiledb_scdataset$somas$RNA$X$members$countsseurat[["RNA"]]tiledb_scdataset$somas$RNA$obsm$memebers$transcriptCoordsseurat@misc$transcriptCoordstiledb_scdataset$somas$RNA$varseurat[["RNA"]]@meta.featurestiledb_scdataset$somas$RNA$obsp$members$spatialdistances_[GUID]seurat@graphs$spatialdistances_[GUID]Negative Probe QC: can be used to remove negative control probes which appear to be behaving like outliers in your tissue. This helps prevent tissue-specific or sample-specific background effects from impacting other QC metrics.
Cell level QC: looks for cells which specific characteristics and removes them. These characteristics include: number of targets detected (higher is better), fraction of probes which are negative controls (lower is better), uniformity of count distribution (higher complexity is preferred), cell size (top % of cell sizes may need to be removed as a QC of segmentation). You may also manually provide a annotation to filter on if there is a cluster of cells you would like removed later in analysis.
FOV QC: identifies FOVs which have generally low expression. There are two approaches available. The “mean” method involves filtering out FOVs where the total count per cell averages below a threshold (default: 100). The “quantile” method involves filtering out FOVs where a specified quantile (default: 90th percentile) of gene expression is not above the median of the negative probes.
Target level QC: checks to see if a target appears to be above background across your dataset based on probe distribution relative to negative control probes included in the assay.
tiledb_scdataset$somas$RNA_normalized_[GUID]$X$members$dataseurat[["RNA_normalized_[GUID]"]]@datatiledb_scdataset$somas$RNA$obsm$members$dimreduction_approximatepca_[GUID]seurat@reductions$approximatepca_[GUID]pca <- as.data.frame(tiledb_scdataset$somas$RNA$obsm$members$dimreduction_pca$to_matrix())
pca$colorBy <- tiledb_scdataset$somas$RNA$obs$to_dataframe(attrs = colorColumn)[[1L]]
pcaPlot(pca, "PCA")
tiledb_scdataset$somas$RNA$obsm$members$dimreduction_approximateumap_[GUID]seurat@reductions$approximateumap_[GUID]umap_TileDB <- as.data.frame(tiledb_scdataset$somas$RNA$obsm$members$dimreduction_approximateumap$to_matrix())
umap_TileDB$colorBy <- tiledb_scdataset$somas$RNA$obs$to_dataframe(attrs = colorColumn)[[1L]]
umapPlot(umap_TileDB, "TileDB - R", size = 0.2)
tiledb_scdataset$somas$RNA$varp$members$RNA_nbclust_[GUID]_log_likelyhood_profilesseurat[["RNA"]]@misc$RNA_nbclust_[GUID]_log_likelyhood_profilesslideMetadata <- metadata[metadata$Run_Tissue_name == slideName &
metadata$fov %in% c(1:5, 22:26, 43:47),]
xySlidePlot(data = slideMetadata, title = "Cell Types in Space",
colorBy = "cellType", size = 0.3, alpha = 0.5)
tiledb_scdataset$somas$RNA$uns$members$[GUID]_celesta.*seurat@misc$[GUID]_celesta.* Many slots that contain [GUID]_celestatiledb_scdataset$somas$RNA$obsp$members$graph_nn_[GUID]_nn & tiledb_scdataset$somas$RNA$obsp$members$graph_nn_[GUID]_snnseurat@graphs$graph_nn_[GUID]_nn & seurat@graphs$graph_nn_[GUID]_snnxySlidePlot(data = slideMetadata, title = "Cell Clusters in Space",
colorBy = "RNA_pca_cluster_default",
size = 0.3, alpha = 0.5, colorColumn = "Cell Cluster")
xySlidePlot(data = slideMetadata, title = "Cell Neighborhoods in Space",
colorBy = "niche",
size = 0.3, alpha = 0.5, colorColumn = "Cell Niche")
tiledb_scdataset$somas$RNA$obsp$members$spatialDE_[GUID]_W, tiledb_scdataset$somas$RNA$varp$members$RNA_spatialDE_[GUID]_LeesLRes, & tiledb_scdataset$somas$RNA$varseurat@graphs$spatialDE_[GUID]_W, seurat[["RNA"]]@misc$RNA_spatialDE_[GUID]_LeesLRes, & seurat[["RNA"]]@meta.featuresLeesLRes <- tiledb_scdataset$members$RNA$varp$get_member("RNA_LeesLRes")$to_dataframe()
# reformat the names of genes (e.g. change 4-1BB to 4.1BB)
LeesLRes$var_id_i <- gsub('-', '.', LeesLRes$var_id_i)
LeesLRes$var_id_i <- gsub('/', '.', LeesLRes$var_id_i)
LeesLRes$var_id_j <- gsub('X4.1BB', '4.1BB', LeesLRes$var_id_j)
# convert to square matrix
namegenes <- sort(unique(unlist(LeesLRes[1:2])))
LeesLRes_matrix <- matrix(0, length(namegenes), length(namegenes),
dimnames = list(namegenes, namegenes))
LeesLRes_matrix[as.matrix(LeesLRes[c("var_id_i", "var_id_j")])] <-
LeesLRes[["value"]]
# heatmap
leesHeatmap <- function(spat_lees_l, n_clust) {
spat_l_hm <- pheatmap::pheatmap(as.matrix(spat_lees_l), scale="none",
color=viridis::viridis(100, option="D"),
border_color=NA, show_rownames=TRUE,
cluster_rows=TRUE, cluster_cols=TRUE,
cutree_rows=n_clust, cutree_cols=n_clust,
main="Lee's L Spatial Association")
return(spat_l_hm)
}
# leesHeatmap(LeesLRes_matrix, 10)
# we plot the top 20 genes with the largest magnitude off-diagonal values.
plot_n = 20
LeesLRes_off <- LeesLRes[LeesLRes$var_id_i != LeesLRes$var_id_j,]
LeesLRes_off <- LeesLRes_off[order(abs(LeesLRes_off$value),
decreasing = TRUE),]
plot_genes <- LeesLRes_off[!duplicated(LeesLRes_off$var_id_i),][1:plot_n,
"var_id_i"]
leesHeatmap(LeesLRes_matrix[plot_genes, plot_genes], 5)
tiledb_scdataset$somas$RNA$uns$members$cellproximity_[GUID]seurat@misc$cellproximity_[GUID]_cellProximity_heatmap & seurat@misc$cellproximity_[GUID]_cellProximity_pcf & seurat@misc$cellproximity_[GUID]_cellProximity_whichcell_pairs <- tiledb::toMatrix(tiledb_scdataset$members$RNA$uns$members$cellProximity$list_object_uris()[1L])
# generate heatmap
pheatmap::pheatmap(cell_pairs, cluster_cols = TRUE, cluster_rows = TRUE,
breaks = seq(-2, 2, length.out = 100),
color = colorRampPalette(c("blue", "white", "red"))(100),
show_rownames = TRUE, show_colnames = TRUE)
tiledb_scdataset$somas$RNA$obsm$members$markerID_[GUID]_cluster_normExpr & tiledb_scdataset$somas$RNA$obsm$members$markerID_[GUID]_markers_result & tiledb_scdataset$somas$RNA$obsm$members$markerID_[GUID]_top_markersseurat@misc$markerID_[GUID]_cluster_normExpr & seurat@misc$markerID_[GUID]_markers_result & seurat@misc$markerID_[GUID]_top_markers# Extract gene X cluster matrix for normalized expression of selected markers
cluster_normExpr <- tiledb_scdataset$somas$RNA$obsm$members$RNA_normalized_cluster_normExpr$to_matrix()
rownames(cluster_normExpr) <- rownames(tiledb_scdataset$somas$RNA$var$to_dataframe())
topMarkers <- as.data.frame(tiledb_scdataset$somas$RNA$obsm$members$RNA_normalized_top_markers$to_matrix())
topMarkers <- topMarkers$feats[as.numeric(aggregate(topMarkers$final_rank, list(topMarkers$cluster), max)$x)]
cluster_normExpr <- cluster_normExpr[rownames(cluster_normExpr) %in%
unique(topMarkers),]
# Heatmap
pheatmap::pheatmap(cluster_normExpr,
color = colorRampPalette(c("blue", "white", "red"))(100),
scale = 'row',
fontsize = 10, cellwidth = 10, border_color = NA,
show_rownames = TRUE, show_colnames = TRUE)
tiledb_scdataset$somas$RNA$uns$members$[GUID]_spatEnrichRes & tiledb_scdataset$somas$RNA$uns$members$[GUID]_cellTypePairs & tiledb_scdataset$somas$RNA$uns$members$[GUID]_linkScores& tiledb_scdataset$somas$RNA$uns$members$[GUID]_cellTypeScores & tiledb_scdataset$somas$RNA$obsm$members$[GUID]_cellScoresseurat@misc$[GUID]_spatEnrichRes & seurat@misc$[GUID]_cellTypePairs & seurat@misc$[GUID]_linkScores & seurat@misc$[GUID]_cellTypeScores & seurat@misc$[GUID]_cellScores## Retrieval of results
# For cellScores result from obsm folder
cellScores <- tiledb_scdataset$somas$RNA$obsm$members$cellScores$to_matrix()
# For LR results from UNS
uri_use <- tiledb_scdataset$members$RNA$uns$members$cellTypeScores$uri
cellTypeScores <- tiledb::tiledb_array(uri = uri_use, return_as = "data.table")[]
uri_use <- tiledb_scdataset$members$RNA$uns$members$cellTypePairs$uri
cellTypePairs <- tiledb::tiledb_array(uri = uri_use, return_as = "data.table")[]
cellTypePairs <- cellTypePairs[,-1]
cellTypePairs <- as.matrix(cellTypePairs[,-c("fromCT", "toCT"),with=FALSE],
rownames = cellTypePairs[,paste0(fromCT, "_", toCT)])
pheatmap::pheatmap(cellTypePairs[head(order(matrixStats::rowVars(cellTypePairs),
decreasing = TRUE), 25),
head(order(matrixStats::colVars(cellTypePairs),
decreasing = TRUE), 10)])
uri_use <- tiledb_scdataset$members$RNA$uns$members$spatEnrichRes$uri
spatEnrichRes <- tiledb::tiledb_array(uri = uri_use, return_as = "data.table")[]
maxPadj <- 0.05 #significance level
spatEnrichRes$PadjSig <- spatEnrichRes$padj < maxPadj
spatEnrichRes$lrname <- sapply(1:nrow(spatEnrichRes),
function(i){paste0(spatEnrichRes$Ligand[i], "/",
spatEnrichRes$Receptor[i])})
spatEnrichRes$intScoreRatio <- spatEnrichRes$intScore/spatEnrichRes$intScore_simAvg
ggplot(spatEnrichRes, aes(x=intScoreRatio, y=-log10(padj), color=PadjSig)) +
geom_point(size=2) +
labs(x = "L-R Interaction Score Ratio",
y = "-log10(Padj)",
color = "Significant",
title = paste0(" Padj < ", maxPadj)) +
geom_line(aes(y=-log10(maxPadj), color = "black"),linetype="dashed") +
scale_color_manual(values = c("TRUE" = "red", "FALSE" = "black")) +
geom_text_repel(aes(label=lrname)) +
theme_bw()
tiledb_scdataset$somas$RNA$obsm$members$PathwayCellsAUC_[GUID]seurat@misc$PathwayCellsAUC_[GUID]pathway <- tiledb_scdataset$members$RNA$obsm$members$PathwayCellsAUC$to_matrix()
maxPath <- order(matrixStats::colSds(pathway[rownames(pathway) %in%
slideMetadata$cell_id,]),
decreasing = TRUE)[1L]
pathwayMeta <- cbind(metadata, umap_TileDB, maxPathway=pathway[,maxPath])
lims <- range(pathway[,maxPath])
maxPathName <- colnames(pathway)[maxPath]
plot_umap_pathway <- ggplot(pathwayMeta, aes(x = APPROXIMATEUMAP_1 ,
y = APPROXIMATEUMAP_2,
color = maxPathway)) +
geom_point(size = 0.3, alpha = 0.3) +
theme_bw() +
labs(color = maxPathName, title = "UMAP") +
viridis::scale_color_viridis(limits = lims, option = "A")
plot_xy_pathway <- ggplot(pathwayMeta[rownames(pathwayMeta) %in%
rownames(slideMetadata),],
aes(x_slide_mm, y_slide_mm)) +
geom_point(aes(color = maxPathway), size = 0.2, alpha = 0.3) +
coord_fixed() +
viridis::scale_color_viridis(limits = lims, option = "A") +
labs(color = maxPathName,
title = paste("Pathway Spatial Plot")) +
theme_bw()
(plot_umap_pathway | plot_xy_pathway ) +
plot_layout(guides = "collect", nrow = 2)
tiledb_scdataset$somas$RNA_normalized$uns$members$[GUID]_prede_* & tiledb_scdataset$somas$RNA_normalized$uns$members$[GUID]_deresults_*seurat@misc$[GUID]_preDE... & seurat@misc$[GUID]_de... Many slots that start with preDE and de.quantile_breaks <- function(xs, n = 10) {
breaks <- quantile(xs, probs = seq(0, 1, length.out = n))
breaks[!duplicated(breaks)]
}
all_de_results <- grep("deresults",
tiledb_scdataset$members$RNA_normalized$uns$list_member_uris(),
value = TRUE)
comparison <- "one.vs.rest" #select comparison
one.vs.rest <- tiledb::tiledb_array(
uri =grep(comparison, all_de_results, value = TRUE),
return_as = "data.table")[]
one.vs.rest <- one.vs.rest[which(one.vs.rest$p.value < 0.05),]
one.vs.rest_fc <- reshape2::acast(one.vs.rest, target~contrast,
value.var = "fold_change")
one.vs.rest_fc[is.na(one.vs.rest_fc)] <- 0
one.vs.rest$log10p <- -log10(one.vs.rest$p.value)
one.vs.rest_p <- reshape2::acast(one.vs.rest, target~contrast,
value.var = "log10p")
one.vs.rest_p[is.na(one.vs.rest_p)] <- 0
plot_list <- list()
mat_breaks <- quantile_breaks(one.vs.rest_fc, n = 11)
plot_list[[1L]] <- pheatmap::pheatmap(one.vs.rest_fc,
color = colorRampPalette(c("white", "blue"))(length(mat_breaks)-1),
fontsize = 10, cellwidth = 15, border_color = NA,
show_rownames = TRUE, show_colnames = TRUE,
main = "Fold Change", cluster_rows = FALSE,
silent = TRUE, breaks = mat_breaks)[[4L]]
plot_list[[2L]] <- pheatmap::pheatmap(one.vs.rest_p,
color = colorRampPalette(c("white", "red"))(length(mat_breaks)-1),
fontsize = 10, cellwidth = 15, border_color = NA,
show_rownames = TRUE, show_colnames = TRUE,
main = "-log10(p value)", cluster_rows = FALSE,
silent = TRUE, breaks = mat_breaks)[[4L]]
do.call("grid.arrange", c(plot_list, ncol =2))
If Raw Files are exported, they will follow this format. It is recommended to only export Raw Files once per study. If multiple pipelines from the same study export Raw File there could be TBs of duplicated files. For this study, the zipped raw files are over 800 GB. These files are not zipped automatically in the export module.
We will continue to refine and update files for export. In future releases, some of these files may be removed to cut down on exported file size.
If you have any questions please reach out to your Applications Scientist or NanoString Support: support@nanostring.com
For Research Use Only. Not for use in diagnostic procedures.
sessionInfo()
## R version 4.2.3 (2023-03-15)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] rminiconda_0.0.1 RcppSpdlog_0.0.12 tiledbsc_0.1.4.9001
## [4] tiledb_0.17.0 patchwork_1.1.2 pheatmap_1.0.12
## [7] reshape2_1.4.4 reticulate_1.28 remotes_2.4.2
## [10] RColorBrewer_1.1-3 ggpubr_0.6.0 matrixStats_0.63.0
## [13] sp_1.6-0 SeuratObject_4.1.0 Seurat_4.1.1
## [16] ggrepel_0.9.3 gridExtra_2.3 ggplot2_3.4.1
## [19] pryr_0.1.6 BiocStyle_2.26.0
##
## loaded via a namespace (and not attached):
## [1] backports_1.4.1 plyr_1.8.8 igraph_1.4.1
## [4] lazyeval_0.2.2 splines_4.2.3 listenv_0.9.0
## [7] scattermore_0.8 urltools_1.7.3 digest_0.6.31
## [10] htmltools_0.5.4 viridis_0.6.2 fansi_1.0.4
## [13] magrittr_2.0.3 tensor_1.5 cluster_2.1.4
## [16] ROCR_1.0-11 globals_0.16.2 spatstat.sparse_3.0-1
## [19] colorspace_2.1-0 lobstr_1.1.2 xfun_0.39
## [22] dplyr_1.1.0 jsonlite_1.8.4 progressr_0.13.0
## [25] spatstat.data_3.0-1 survival_3.5-3 zoo_1.8-11
## [28] glue_1.6.2 RcppCCTZ_0.2.12 polyclip_1.10-4
## [31] gtable_0.3.2 leiden_0.4.3 car_3.1-2
## [34] future.apply_1.10.0 spdl_0.0.4 abind_1.4-5
## [37] scales_1.2.1 rstatix_0.7.2 spatstat.random_3.1-4
## [40] miniUI_0.1.1.1 Rcpp_1.0.10 viridisLite_0.4.1
## [43] xtable_1.8-4 spatstat.core_2.4-4 bit_4.0.5
## [46] htmlwidgets_1.6.2 httr_1.4.5 ellipsis_0.3.2
## [49] ica_1.0-3 farver_2.1.1 pkgconfig_2.0.3
## [52] sass_0.4.5 uwot_0.1.14 deldir_1.0-6
## [55] here_1.0.1 utf8_1.2.3 labeling_0.4.2
## [58] tidyselect_1.2.0 rlang_1.1.0 later_1.3.0
## [61] munsell_0.5.0 tools_4.2.3 cachem_1.0.7
## [64] cli_3.6.0 generics_0.1.3 broom_1.0.4
## [67] ggridges_0.5.4 evaluate_0.20 stringr_1.5.0
## [70] fastmap_1.1.1 yaml_2.3.7 goftest_1.2-3
## [73] knitr_1.42 bit64_4.0.5 fs_1.6.1
## [76] fitdistrplus_1.1-8 purrr_1.0.1 RANN_2.6.1
## [79] pbapply_1.7-0 future_1.32.0 nlme_3.1-162
## [82] mime_0.12 compiler_4.2.3 rstudioapi_0.14
## [85] plotly_4.10.1 png_0.1-8 ggsignif_0.6.4
## [88] spatstat.utils_3.0-2 tibble_3.2.0 bslib_0.4.2
## [91] stringi_1.7.12 nanotime_0.3.7 highr_0.10
## [94] rgeos_0.6-2 lattice_0.20-45 Matrix_1.5-3
## [97] vctrs_0.6.0 pillar_1.9.0 lifecycle_1.0.3
## [100] BiocManager_1.30.20 triebeard_0.4.1 spatstat.geom_3.1-0
## [103] lmtest_0.9-40 jquerylib_0.1.4 RcppAnnoy_0.0.20
## [106] data.table_1.14.8 cowplot_1.1.1 irlba_2.3.5.1
## [109] httpuv_1.6.9 R6_2.5.1 bookdown_0.34
## [112] promises_1.2.0.1 KernSmooth_2.23-20 parallelly_1.34.0
## [115] codetools_0.2-19 MASS_7.3-58.2 rprojroot_2.0.3
## [118] rjson_0.2.21 withr_2.5.0 sctransform_0.3.5
## [121] mgcv_1.8-42 parallel_4.2.3 grid_4.2.3
## [124] rpart_4.1.19 tidyr_1.3.0 rmarkdown_2.20
## [127] carData_3.0-5 Rtsne_0.16 shiny_1.7.4